Obfuscating Document Stylometry to Preserve Author Anonymity

نویسندگان

  • Gary Kacmarcik
  • Michael Gamon
چکیده

This paper explores techniques for reducing the effectiveness of standard authorship attribution techniques so that an author A can preserve anonymity for a particular document D. We discuss feature selection and adjustment and show how this information can be fed back to the author to create a new document D’ for which the calculated attribution moves away from A. Since it can be labor intensive to adjust the document in this fashion, we attempt to quantify the amount of effort required to produce the anonymized document and introduce two levels of anonymization: shallow and deep. In our test set, we show that shallow anonymization can be achieved by making 14 changes per 1000 words to reduce the likelihood of identifying A as the author by an average of more than 83%. For deep anonymization, we adapt the unmasking work of Koppel and Schler to provide feedback that allows the author to choose the level of anonymization.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

SU@PAN'2016: Author Obfuscation

The anonymity of a text’s writer is an important topic for some domains, such as witness protection and anonymity programs. Stylometry can be used to reveal the true author of a text even if s/he wishes to hide his/her identity. In this paper, we present our approach for hiding an author’s identity by masking their style, which we developed for the Author Obfuscation task, part of the PAN-2016 ...

متن کامل

A Study on Author Identification through Stylometry

Electronic communication is one of the popular ways of communication in this era. E-mail communication is the most popular way of electronic communication. Internet works as the backbone for these communications. In digital forensics, questions is arises that the authors of documents and the author identity, demographic background is linked to other documents or not. So identification of the au...

متن کامل

Slavonic Corpus for Stylometry Research

Stylometry techniques such as authorship recognition, machine translation detection and pedophile identification are daily used in applications for the most widely used languages. But under-represented languages lack data sources usable for stylometry research. In this paper, we propose an algorithm to build corpora containing meta-information required for stylometry experiments (author informa...

متن کامل

Document Author Classification using Generalized Discriminant Analysis

Classification by document authorship based on statistical analysis — stylometry — is considered here by using feature vectors obtained from counts of all words in the intersecting sets of the training data. This differs from some previous stylometry, which used only selected “noncontextual” words with the highest counts, and also from conventional text search techniques, where noncontextual wo...

متن کامل

Caging the Muse: Metrics for Unconscious Author Markers

Stylometry is usually concerned with finding an authorial invariant, and attempts at authorship identification often model authors based on topics, putting weight on consciously selected content words and their unconsciously controlled frequency. This paper presents two uses of stylometry that rely on factors beyond the control of the author: document dating to a given historical period by dete...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006